DataFrames are a lovely way to store data. They're essentially matrices that can store almost any type of data and are a great option for handling data where you want to keep track of rows and columns with labels. The Pandas library also comes with a handy host of functions that allow you to work with your DataFrames in very smart ways (eg. means, split-apply-combine).
The Pandas website has a lot of great documentation to help you get started. However, even after reading the official tutorials, I was stumped on some problems using DataFrames. This lesson is based directly off of how I solved those problems, so hopefully some will find it helpful.
In [1]:
import pandas as pd
import numpy as np
blank = pd.DataFrame()
blank
Out[1]:
In [2]:
movies = pd.DataFrame(np.zeros((4,4)), index=['Forrest Gump', 'Scanners', '2010: Odyssey Two', 'Fern Gully'], columns = ['Date Released', 'Box Office Gross', 'IMDB Score', 'Tomatometer'])
movies
Out[2]:
In [3]:
zeros = pd.DataFrame(np.zeros((3,5)))
zeros
Out[3]:
Got a csv file you want to read? No problem!
In [4]:
probedata = pd.read_csv('12probe20cm.csv')
probedata[1:4]
Out[4]:
In [5]:
movies['Date Released'] = ['1994','1981','1984','1992']
movies['Box Office Gross'] = [50,12,8,16]
movies['IMDB Score'] = [8.8,6.8,6.8,6.4]
movies['Tomatometer'] = [.72,.80,.66,.71]
movies[1:2]
#OR
movies.iloc[1:2]
Out[5]:
In [6]:
movies[['IMDB Score','Tomatometer']]
Out[6]:
In [7]:
movies[movies['IMDB Score']<(movies['Tomatometer']*10)]
Out[7]:
Columns can be added by setting a new column to another DataFrame. Just make sure that the indices are compatible!
In [8]:
favlist = pd.DataFrame([True,True,True,True],index=movies.index)
movies['Childhood Top 10?'] = favlist
movies
Out[8]:
Rows can be added by using the append function on a DataFrame, taking an appropriately index and columned DataFrame that will be added on the end of the first DataFrame.
In [9]:
wildwildwest = pd.DataFrame(index=['Wild Wild West'], columns=movies.columns)
wildwildwest.iloc[0] = ['1999', 2, 4.8,.17,False]
movies.append(wildwildwest)
Out[9]:
The concat function is also very useful. It can handle DataFrames with different indices and/or columns. There are multiple ways to joining the indices and columns however you would like
In [11]:
pd.concat([movies,movies])
Out[11]:
Pandas.GroupBy is a great function that allows you to process your data in many different ways without having to get fancy or write any loops.
Essentially, it carries out three different steps:
Splitting the data into different groups [eg. treatment condition]
Applying some function to your data [eg. mean]
Combining the results back into another DataFrame
In my work, this function was extremely useful when I wanted to obtain the mean time spent in target zone for each of my treatment groups
In [12]:
probedata
Out[12]:
In [13]:
grouped = probedata.groupby(['Group']).mean()
grouped
Out[13]:
In [14]:
import matplotlib.pylab as plt
import seaborn as sns
sns.set_context("talk")
grouped[['Zone 1 %','Target Zone %','Zone 3 %','Zone 4 %']].plot(kind='bar')
plt.title('Time spent in Zone')
plt.ylabel('Time (sec)')
plt.xlabel('Zone')
locs, labels = plt.xticks()
plt.setp(labels,rotation=0)
plt.show()
In [ ]: